Return to Overview

In this scenario, a non-Profit has data on several states’ homeless shelters and would like to know where to direct their limited resources for the most impact. Data comes from Kaggle

We begin by summarizing the data and determining if it’s missing any values. We also see how many states we are looking at.

rawData <- read.csv("homelessness_shelter_data.csv")
summary(rawData)
##        id             date           shelter_name           city          
##  Min.   :   1.0   Length:1000        Length:1000        Length:1000       
##  1st Qu.: 250.8   Class :character   Class :character   Class :character  
##  Median : 500.5   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 500.5                                                           
##  3rd Qu.: 750.2                                                           
##  Max.   :1000.0                                                           
##     state           total_capacity  occupied_beds    available_beds  
##  Length:1000        Min.   : 50.0   Min.   :  0.00   Min.   :  0.00  
##  Class :character   1st Qu.:115.0   1st Qu.: 38.00   1st Qu.: 35.00  
##  Mode  :character   Median :182.0   Median : 76.00   Median : 71.00  
##                     Mean   :179.1   Mean   : 91.87   Mean   : 87.25  
##                     3rd Qu.:243.0   3rd Qu.:136.00   3rd Qu.:128.25  
##                     Max.   :300.0   Max.   :294.00   Max.   :296.00  
##  occupancy_rate    average_age    male_percentage female_percentage
##  Min.   :  0.00   Min.   :18.00   Min.   :40.00   Min.   :30.00    
##  1st Qu.: 26.77   1st Qu.:30.00   1st Qu.:47.00   1st Qu.:38.00    
##  Median : 51.85   Median :42.00   Median :55.00   Median :45.00    
##  Mean   : 51.21   Mean   :42.04   Mean   :54.63   Mean   :45.37    
##  3rd Qu.: 76.80   3rd Qu.:54.00   3rd Qu.:62.00   3rd Qu.:53.00    
##  Max.   :100.00   Max.   :65.00   Max.   :70.00   Max.   :60.00    
##     season             notes          
##  Length:1000        Length:1000       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
## 
length(unique(rawData$state))
## [1] 6
unique(rawData$state)
## [1] "TX" "CA" "AZ" "IL" "NY" "PA"

Next, we see how each city holds up in terms of occupancy rate and total available beds



We can compare that with the cities that currently have the highest occupancy rate



I have a theory that demand for the shelters change by season. Winter may have more demand. In addition, I want to see the distribution of the data, so I use a violin plot. I really like these as they give a lot of information at once.



Doesn’t seem like demand really changes per season. It does however seem like there is a subgroup of data that is much larger than the others, as indicated by the mean being higher than the median. This is what we want to identify; our non-profit has limited resources and wants to give them to the shelter(s) most in need.

We’ll start with a higher view and identify the states that have the highest occupancy rate on average. Here’s a better way to view the above bar plots in a single plot, making the information easier to identify. We can see the availabiliy and occupancy rate together.




We see that Chicago, San Jose, and New York have low availability, and generally high occupancy. That lets us narrow our scope a bit more and we can see how the occupancy rate has changed over time for these shelters



It seems the beginning of 2025 saw an increase in all three cities. The data is quite variable month-to-month so it’s hard to see patterns, but this doesn’t seem to happen at the beginning of 2024. Of all three cities, Chicago has seen double the average occupancy rate since this time 2023. We’ll next dive into Chicago’s shelters specifically

Let’s see how Chicago shelter demands change by season



Oddly enough, there seems to be higher demand in Spring, less in Summer.

We still need to narrow this down a bit, as there are many shelters in Chicago. Let’s see which ones have had the largest increases to their average ocuppany rate

rawData %>%
  filter(city=='Chicago') %>%
  group_by(shelter_name,date) %>%
  mutate(occupancy_rate = mean(occupancy_rate)) %>%
  distinct(shelter_name, date, .keep_all = TRUE) %>%
  group_by(shelter_name) %>%
  mutate(date = as.Date(date),
         firstOR = occupancy_rate[date==min(date)],
         lastOR = occupancy_rate[date==max(date)],
         change = ((lastOR - firstOR)/firstOR)*100
         ) %>%
  distinct(shelter_name, change) %>%
  arrange(desc(change))
## # A tibble: 10 × 2
## # Groups:   shelter_name [10]
##    shelter_name       change
##    <chr>               <dbl>
##  1 Safe Haven         1626. 
##  2 Sunrise Shelter     218. 
##  3 Shelter Plus        128. 
##  4 Recovery Residence   58.7
##  5 Second Chance        25.3
##  6 Harbor Home          10.1
##  7 Pathway Place       -11.9
##  8 New Beginnings      -16.3
##  9 HomeSafe            -60.4
## 10 Hope House          -65.6


And here we have our shelters who have seen the most demand without having enough capacity. They all seem to follow the 2025 year increase, outside the usual season that they see more demand.

Next steps would be to potentially predict how much demand these shelters will see (not enough data in this set to do this accurately). We could also look at which shelters encounter occupancy limits season-by-season which may change our decision.